Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

caddyhttp: Use sendfile(2) by implementing ReadFrom on wrappers #5022

Merged
merged 2 commits into from
Sep 7, 2022

Conversation

flga
Copy link
Collaborator

@flga flga commented Sep 6, 2022

Doing so allows for splice/sendfile optimizations when available. Since splice can now be leveraged as well, reverse proxy might also benefit tho I haven't confirmed it.
Fixes #4731

To check whether sendfile is really being called or not we can construct two files, any file over 512 bytes should use sendfile in http 1.

$ wc -c *.html
 512 nosendfile.html
 513 sendfile.html

logs trimmed for brevity

$ sudo strace -f -e sendfile ./caddy file-server --root . --listen 127.0.0.1:8080 --access-log
strace: Process 137302 attached
[...]
2022/09/06 21:56:45.335 INFO    Caddy serving static files on 127.0.0.1:8080
2022/09/06 21:56:50.505 INFO    http.log.access handled request {"request": {"proto": "HTTP/1.1", "method": "GET", "uri": "/nosendfile.html", "size": 512, "status": 200}}
[pid 137301] sendfile(7, 8, NULL, 1)    = 1
2022/09/06 21:57:00.281 INFO    http.log.access handled request {"request": {"proto": "HTTP/1.1", "method": "GET", "uri": "/sendfile.html", "size": 513, "status": 200}}

Given this change works under the covers I've added some tests to ensure it won't break.

It would be nice if we could fix ResponseWriterWrapper in a less blunt way. Currently, if the underlying writer does not support ReadFrom we call io.Copy (again). Unfortunately we're fairly constrained on what we can do without breaking everything everywhere so this seemed the most reasonable, backwards compatible, option. The ideal solution would be to not rely on embedding a ResponseWriterWrapper but creating one at runtime that implements specifically and exclusively what the underlying writer exposes.

ResponseRecorder is now aware of ReadFrom too, since it needs to spy on it to gather response size.

@flga flga requested a review from mholt September 6, 2022 22:14
@francislavoie francislavoie added the optimization 📉 Performance or cost improvements label Sep 6, 2022
@francislavoie francislavoie added this to the v2.6.0 milestone Sep 6, 2022
@francislavoie
Copy link
Member

Awesome! I'm curious how much of a throughput benefit this will have for Caddy with this change!

@mholt
Copy link
Member

mholt commented Sep 7, 2022

Holy shnikies -- I think it works:

# (before patch)
$ wrk -t12 -c400 -d10s http://127.0.0.1:1234/index.html
Running 10s test @ http://127.0.0.1:1234/index.html
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    10.32ms   13.39ms 215.51ms   81.90%
    Req/Sec     8.11k     1.10k   19.61k    78.08%
  968720 requests in 10.04s, 35.33GB read
Requests/sec:  96460.94
Transfer/sec:      3.52GB

# (after patch)
$ wrk -t12 -c400 -d10s http://127.0.0.1:1234/index.html
Running 10s test @ http://127.0.0.1:1234/index.html
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.64ms    7.39ms  77.27ms   82.77%
    Req/Sec    12.34k     1.75k   19.47k    68.00%
  1476888 requests in 10.06s, 53.86GB read
Requests/sec: 146868.09
Transfer/sec:      5.36GB

Where index.html is just the Caddy homepage, and my Caddyfile is:

:1234

root ./test
file_server

Bravo, @flga -- you've nearly doubled the performance of the file server over HTTP.

Interestingly, I also see performance improvements with TLS:

# (before patch)
$ wrk -t12 -c400 -d10s https://localhost/index.html
Running 10s test @ https://localhost/index.html
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    13.29ms   22.99ms 433.93ms   90.40%
    Req/Sec     6.63k     0.98k    9.87k    74.24%
  769211 requests in 10.09s, 28.08GB read
Requests/sec:  76236.31
Transfer/sec:      2.78GB

# (after patch)
$ wrk -t12 -c400 -d10s https://localhost/index.html
Running 10s test @ https://localhost/index.html
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     9.19ms   20.59ms 424.09ms   93.98%
    Req/Sec     8.34k     1.58k   17.76k    74.69%
  963389 requests in 10.10s, 35.17GB read
Requests/sec:  95389.21
Transfer/sec:      3.48GB

which I would not expect to happen. But I was able to repeat these results. This might be good news for production sites.

I still want to collect profiles during these tests to see where the differences are when sendfile() is used versus when it's not. I might do that later tonight or tomorrow.

@flga Can I invite you to our developers Slack? It's for Caddy maintainers, sponsors, developers, etc. Would love to have you hang out with us there. Just let me know which email to send the invite to.

@francislavoie
Copy link
Member

Tried it out for myself as well! I can confirm essentially the same thing:

HTTP Without patch:

Requests/sec:  35762.96
Transfer/sec:      1.30GB
Requests/sec:  35393.53
Transfer/sec:      1.29GB

HTTP With patch:

Requests/sec:  44787.90
Transfer/sec:      1.63GB
Requests/sec:  47676.41
Transfer/sec:      1.74GB

HTTPS Without patch:

Requests/sec:  29501.75
Transfer/sec:      1.08GB
Requests/sec:  27661.70
Transfer/sec:      1.01GB

HTTPS With patch:

Requests/sec:  33021.70
Transfer/sec:      1.21GB
Requests/sec:  32188.66
Transfer/sec:      1.17GB

@mholt
Copy link
Member

mholt commented Sep 7, 2022

Thanks for verifying, Francis!

As Francis suggested in Slack, the improvements with TLS are probably due to ReadFrom() being an optimization in and of itself, as fewer buffers are allocated for the copying. It probably doesn't use sendfile but you still get some optimization with ReadFrom()!

@mholt
Copy link
Member

mholt commented Sep 7, 2022

CPU profiles were almost identical.

Heap profiles were also pretty similar, but with strikingly fewer allocations and copying across buffers in the patched version (as expected):

profile001

@flga

It would be nice if we could fix ResponseWriterWrapper in a less blunt way. Currently, if the underlying writer does not support ReadFrom we call io.Copy (again). Unfortunately we're fairly constrained on what we can do without breaking everything everywhere so this seemed the most reasonable, backwards compatible, option. The ideal solution would be to not rely on embedding a ResponseWriterWrapper but creating one at runtime that implements specifically and exclusively what the underlying writer exposes.

I agree, I just don't know how to do that... still, I feel like by now we've implemented the most crucial interfaces... I'm open for ideas on better ways to do this though.

Thanks for the test case by the way!

The code change looks good and is just about exactly how I would have done it. Better, even.

I'll do a final review soon and maybe tweak a thing or two, but either way we'll get this merged very shortly. Thank you for this great patch!

@mholt
Copy link
Member

mholt commented Sep 7, 2022

This has the green light to be merged any time -- so feel free. I am curious how this might affect reverse proxy (both the HTTP transport and the FastCGI transport), as I haven't had a chance to test those yet. If we need to add a ReadFrom or WriteTo somewhere, we can do that. Doesn't have to be in this PR; could come later.

@francislavoie
Copy link
Member

I tried with 2 Caddy instances, a file_server one and a reverse_proxy one with the fileserver as an upstream; tried the proxy instance before/after the patch, not seeing any difference (within margin of error).

@mholt
Copy link
Member

mholt commented Sep 7, 2022

Verified that this does not affect reverse proxy. (Which is fine, maybe a later patch can address that.)

Our reverse proxy is derived from the std lib's net/http/httputil reverse proxy, which implements its own copyBuffer() function, which they derived from the io package (io.Copy(), io.CopyBuffer(), etc all use it). Interestingly, the io.copyBuffer function does use ReadFrom and WriteTo if available. However, the httputil.copyBuffer function does NOT use this, even though much of the remaining code is similar.

Other differences between the two are primarily about error handling.

The omission of this optimization in httputil appears to be intentional: golang/go#21814

However, just for kicks and giggles I added it back into ours to see what the difference is, even though error handling is "incorrect" when the optimization is used:

func (h Handler) copyBuffer(dst io.Writer, src io.Reader, buf []byte) (int64, error) {
	if wt, ok := src.(io.WriterTo); ok {
		return wt.WriteTo(dst)
	}
	if rt, ok := dst.(io.ReaderFrom); ok {
		return rt.ReadFrom(src)
	}
	// ...

Proxying the same payload as my test above, I got about 35k-36k req/sec. With this patch and re-adding the ReadFrom call, I observed about 40k req/sec. Not a huge boost, but a repeatable result, and still an improvement.

Go's sendfile support only works for TCP connections with an *os.File on one side, even though sendfile(2) supports any file descriptors. That seems unfortunate that we can't copy from one TCP socket to another (unless TCPConns do assert as *os.File? I doubt it?) -- maybe I'm wrong, but that's what it seems like.

So anyway, if this patch were to affect reverse proxy, we'd just need to call ReadFrom in our copyBuffer. It doesn't use sendfile (unless we implement that manually -- which I'd be open to reviewing), but it does reduce allocations/copies. But then, we have the problem of less control over error handling, which is why httputil doesn't just use io.copyBuffer. Hmm.

So I just thought I'd mention this here.

This patch still LGTM!

flga and others added 2 commits September 7, 2022 20:43
…From if the underlying response writer implements it. Doing so allows for splice/sendfile optimizations when available.
@flga flga merged commit dd9813c into master Sep 7, 2022
@flga flga deleted the readfrom branch September 7, 2022 20:14
@mholt mholt changed the title caddyhttp: ensure ResponseWriterWrapper and ResponseRecorder use ReadFrom if the underlying response writer implements it. caddyhttp: Use sendfile(2) by implementing ReadFrom on wrappers Sep 7, 2022
@flga
Copy link
Collaborator Author

flga commented Sep 7, 2022

Hey those are some nice gains, I was expecting something more modest around the 10% mark but I won't complain 😄

Go's sendfile support only works for TCP connections with an *os.File on one side, even though sendfile(2) supports any file descriptors. That seems unfortunate that we can't copy from one TCP socket to another (unless TCPConns do assert as *os.File? I doubt it?) -- maybe I'm wrong, but that's what it seems like.

sendfile doesn't work with sockets, that'd be the a job for splice, which TCPConn is also aware of. However splice requires a TCPConn on both sides so it's really only applicable when proxying transparently, if we need to mitm we lose that ability.

So it kinda ends up in sendfile terrain, cool but only useful in what I would expect to be a minority of caddy setups (I'd wager most leverage at least tls).

@mholt re: slack, I can dm you my email on twitter @flga_

@mholt
Copy link
Member

mholt commented Sep 8, 2022

@flga Sure, DM sent.

Ah right, I forgot about the differences between sendfile and splice.

Once we've written the headers from the backend, it might very well be possible to use splice to finish copying the response body. I'd love to review and test that! (I know it's HTTP-only but I bet we have more HTTP reverse proxies than HTTP file servers out there, since reverse proxying is very common on internal environments where TLS isn't needed.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization 📉 Performance or cost improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Figure out why Caddy isn't using sendfile
3 participants